Goto

Collaborating Authors

 critical point


From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

arXiv.org Machine Learning

We study the minimization of non-convex functionals over the Wasserstein space. While recent work has showed that perturbed Wasserstein gradient methods can avoid saddle points for benign landscapes, existing approaches remain essentially first-order and do not provide fast local convergence once the iterates enter a neighborhood of a global minimizer. We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an $ฮฑ$-neighborhood of a global minimizer in polynomial time, with improved dependence on saddle parameters compared with prior perturbed first-order methods. Once inside this neighborhood, we show that WSFN converges linearly in $L^2$-Wasserstein distance to a non-degenerate global minimizer. Finally, we present a particle-based implementation of the method.


Attention-based PCA

arXiv.org Machine Learning

We study attention mechanisms through the lens of a canonical unsupervised problem: principal component analysis (PCA). We show that, when trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finiteprompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction. These results demonstrate that attention inherently performs PCA-like computations under unsupervised objectives, providing a theoretical foundation for its representation-learning capabilities.


f5ccb3ab757131a93586ef61ec701533-Supplemental-Conference.pdf

Neural Information Processing Systems

In this section, we compare the symmetric solutions found in erf [2] and ReLU networks [5] to our one-neuron solution (n =1). The main difference is that both earlier studies constrain the search space to the symmetric subspace whereas we first prove that the non-trivial critical points are contained in this subspace in Theorem 5.1 for a broad class of activation functions, including erf and ReLU. Solving the low-dimensional loss, we recover the same solution for ReLU and erf as in [2, 5] for unit-orthonormal teachers.


Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Neural Information Processing Systems

Any continuous function f can be approximated arbitrarily well by a neural network with sufficiently many neurons k. We consider the case when f itself is a neural network with one hidden layer and k neurons. Approximating f with a neural network with n < k neurons can thus be seen as fitting an under-parameterized "student" network with nneurons to a "teacher" network with k neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the n student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when n 1student neurons each copy one teacher neuron and the n-th student neuron averages the remaining k n+1 teacher neurons. For the student network with n = 1 neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.






WiVi Y

Neural Information Processing Systems

See Appendix A.9 for the derivation of the Hessian. We have proved the statement (a). Notice that this is a contradiction because any point with WL+1 = 0 is in the set S. Hence, there exists no point at which the Hessian is negative semidefinite. Because the negative semidefiniteness is a necessary condition for a local maximum, every critical point is then either a local minimum or a saddle point. We have proved the statement (b).


Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

Neural Information Processing Systems

In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately. In this paper, we address the cross-modal alignment challenge from the perspective of fine-grain. Firstly, to alleviate weak cross-modal alignment supervision from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset, namely Landmark-RxR. Secondly, to further enhance local cross-modal alignment under fine-grained supervision, we investigate the focal-oriented rewards with soft and hard forms, by focusing on the critical points sampled from fine-grained Landmark-RxR. Moreover, to fully evaluate the navigation process, we also propose a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories. Experimental results show that our agent has superior navigation performance on Landmark-RxR, en-RxR and R2R.